Analysis of the stats of the Experimental packages in Bioconductor project.
Here we are going to analyse the Annotation packages of Bioconductor. See the home of the analysis here.
First we read the latest data from the Bioconductor project. There are two files, one with the download stats from 2009 until today and another with the download stats of the software packages, we will only use the first one:
load("stats.RData")
stats <- stats[Category == "Annotation", ]
stats
## Package Year Month Nb_of_distinct_IPs
## 1: BSgenome.Dmelanogaster.UCSC.dm3 2014 08 1
## 2: BSgenome.Dmelanogaster.UCSC.dm3 2017 01 74
## 3: BSgenome.Dmelanogaster.UCSC.dm3 2017 02 163
## 4: BSgenome.Dmelanogaster.UCSC.dm3 2017 03 135
## 5: BSgenome.Dmelanogaster.UCSC.dm3 2017 04 125
## 6: BSgenome.Dmelanogaster.UCSC.dm3 2017 05 175
## 7: BSgenome.Dmelanogaster.UCSC.dm3 2017 06 162
## 8: BSgenome.Dmelanogaster.UCSC.dm3 2016 01 125
## 9: BSgenome.Dmelanogaster.UCSC.dm3 2016 02 221
## 10: BSgenome.Dmelanogaster.UCSC.dm3 2016 03 181
## ---
## 104967: MafDb.gnomADex.r2.0.1.hs37d5 2017 05 7
## 104968: MafDb.gnomADex.r2.0.1.hs37d5 2017 06 8
## 104969: SNPlocs.Hsapiens.dbSNP149.GRCh38 2017 02 6
## 104970: SNPlocs.Hsapiens.dbSNP149.GRCh38 2017 03 3
## 104971: SNPlocs.Hsapiens.dbSNP149.GRCh38 2017 04 3
## 104972: SNPlocs.Hsapiens.dbSNP149.GRCh38 2017 05 14
## 104973: SNPlocs.Hsapiens.dbSNP149.GRCh38 2017 06 15
## 104974: TxDb.Ggallus.UCSC.galGal5.refGene 2017 04 3
## 104975: TxDb.Ggallus.UCSC.galGal5.refGene 2017 05 15
## 104976: TxDb.Ggallus.UCSC.galGal5.refGene 2017 06 11
## Nb_of_downloads Category Date
## 1: 1 Annotation 2014-08-01 02:00:00
## 2: 142 Annotation 2017-01-01 01:00:00
## 3: 285 Annotation 2017-02-01 01:00:00
## 4: 225 Annotation 2017-03-01 01:00:00
## 5: 207 Annotation 2017-04-01 02:00:00
## 6: 259 Annotation 2017-05-01 02:00:00
## 7: 304 Annotation 2017-06-01 02:00:00
## 8: 186 Annotation 2016-01-01 01:00:00
## 9: 297 Annotation 2016-02-01 01:00:00
## 10: 262 Annotation 2016-03-01 01:00:00
## ---
## 104967: 7 Annotation 2017-05-01 02:00:00
## 104968: 9 Annotation 2017-06-01 02:00:00
## 104969: 6 Annotation 2017-02-01 01:00:00
## 104970: 4 Annotation 2017-03-01 01:00:00
## 104971: 3 Annotation 2017-04-01 02:00:00
## 104972: 18 Annotation 2017-05-01 02:00:00
## 104973: 17 Annotation 2017-06-01 02:00:00
## 104974: 4 Annotation 2017-04-01 02:00:00
## 104975: 18 Annotation 2017-05-01 02:00:00
## 104976: 11 Annotation 2017-06-01 02:00:00
There have been 2571 Experimental packages in Bioconductor. Some have been added recently and some later.
First we explore the number of packages being downloaded by month:
theme_bw <- theme_bw(base_size = 16)
scal <- scale_x_datetime(date_breaks = "3 months")
ggplot(stats[, .(Downloads = .N), by = Date], aes(Date, Downloads)) +
geom_bar(stat = "identity") +
theme_bw +
ggtitle("Packages downloaded") +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
scal +
xlab("")
Figure 1: Packages in Bioconductor with downloads
The number of packages being downloaded is increasing with time almost exponentially. Partially explained with the incorporation of new packages
ggplot(stats[, .(Number = sum(Nb_of_downloads)), by = Date], aes(Date, Number)) +
geom_bar(stat = "identity") +
theme_bw +
ggtitle("Downloads") +
scal +
theme(axis.text.x=element_text(angle=60, hjust=1)) +
xlab("")
Figure 2: Downloads of packages
Even if the number of packages increase exponentially, the number of the downloads from 2011 grows linearly with time. Which indicates that each time a software package must compete with more packages to be downloaded.
pd <- position_dodge(0.1)
ggplot(stats[, .(Number = mean(Nb_of_downloads),
ymin = mean(Nb_of_downloads)-1.96*sd(Nb_of_downloads)/sqrt(.N),
ymax = mean(Nb_of_downloads)+1.96*sd(Nb_of_downloads)/sqrt(.N)),
by = Date], aes(Date, Number)) +
geom_errorbar(aes(ymin = ymin, ymax = ymax), width=.1, position=pd) +
geom_point() +
geom_line() +
theme_bw +
ggtitle("Downloads") +
ylab("Mean download for a package") +
scal +
theme(axis.text.x=element_text(angle=60, hjust=1)) +
xlab("")
Figure 3: Downloads of packages per package
The error bar indicates the 95% confidence interval.
Here we can apreciate that the number of downloads per package hasn’t changed much with time. If something, now there is less dispersion between packages downloads.
This might be due to an increase in the usage of packages or that new packages bring more users. We start knowing how many packages has been introduced in Bioconductor each month.
today <- base::date()
year <- substr(today, 21, 25)
month <- monthsConvert(substr(today, 5, 7))
incorporation <- stats[ , .SD[which.min(Date)], by = Package, .SDcols = "Date"]
histincorporation <- incorporation[, .(Number = .N), by = Date, ]
ggplot(histincorporation, aes(Date, Number)) +
geom_bar(stat="identity") +
theme_bw +
ggtitle("Packages with first download") +
scal +
theme(axis.text.x=element_text(angle=60, hjust=1)) +
xlab("")
Figure 4: New packages
We can see that there were more than 60 packages before 2009 in Bioconductor, and since them occasionally there is a raise to 10 new downloads (Which would be new packages being added).
ggplot(histincorporation, aes(Date, Number)) +
geom_bar(stat="identity") +
theme_bw +
ggtitle("Packages with first download") +
scal +
theme(axis.text.x=element_text(angle=60, hjust=1)) +
xlab("") +
ylim(c(0, 20))
## Warning: Removed 15 rows containing missing values (position_stack).
Figure 5: New packages
Close view to the new packages not previously downloaded. ## Removed
Using a similar procedure we can approximate the packages deprecated and removed each month. In this case we look for the last date a package was downloaded, excluding the current month:
deprecation <- stats[, .SD[which.max(Date)], by = Package, .SDcols = c("Date", "Year", "Month")]
deprecation <- deprecation[Month != month & Year == Year, , .SDcols = "Date"] # Before this month
histDeprecation <- deprecation[, .(Number = .N), by = Date, ]
ggplot(histDeprecation, aes(Date, Number)) +
geom_bar(stat = "identity") +
theme_bw +
ggtitle("Packages without downloads") +
scal +
theme(axis.text.x=element_text(angle=60, hjust=1)) +
ylab("Last seen packages") +
xlab("")
Figure 6: Date where a package was last downloaded
Aproximates to the date when packages were removed from Bioconductor.
Here we can see the packages whose last download was in certain month, assuming that this means they are deprecated. It can happen that a package is no longer downloaded but is still in Bioconductor repository, this would be the reason of the spike to 3000 packages as per last month. In total there are 1128 packages downloaded. We further explore how many time between the incorporation of the package and the last download.
df <- merge(incorporation, deprecation, by = "Package")
timeBioconductor <- unclass(df$Date.y-df$Date.x)/(60*60*24*365) # Transform to years
hist(timeBioconductor, main = "Time in Bioconductor", xlab = "Years")
abline(v = mean(timeBioconductor), col = "red")
abline(v = median(timeBioconductor), col = "green")
(#fig:time.package)Time of packages between first and last download
Packages tend to stay up to 10 years. Not surprisingly the number of packages incorporated before 2009 and still in the repository are of 0 packages. But those packages not removed how do they do in Bioconductor?
We can start comparing the number of downloads (different from 0) by how many IPs download each package.
ggplot(stats, aes(Nb_of_distinct_IPs, Nb_of_downloads, col = Package)) +
geom_point() +
theme_bw +
geom_smooth(method = "lm") +
xlab("Number of distinct IPs") +
ylab("log10(Number of downloads)") +
ggtitle("Downloads by different IP") +
geom_abline(slope = 2) +
guides(col = FALSE)
Figure 7: Downloads and distinct IPs of all months and packages
Each color is a package, the black line represents 2 downloads per IP.
Not surprisingly most of the package has two downloads from the same IP, one for each Bioconductor release (black line). However, there are some packages where few IPs download many times the same package, which may indicate that these packages are mostly installed in a few locations.
ratio <- stats[, .(slope = coef(lm(Nb_of_downloads~Nb_of_distinct_IPs))[2]), by = Package]
ratio <- ratio[order(slope, decreasing = TRUE), ]
ratio <- ratio[!is.na(slope), ]
ratio$Package <- as.character(ratio$Package)
ratio
## Package slope
## 1: BSgenome.Hsapiens.NCBI.GRCh38 7.3759554
## 2: hgu95aprobe 6.0319987
## 3: org.EcK12.eg.db 5.3779110
## 4: hs133bptentrezg 5.2747253
## 5: pd.hg.u133a.2 5.2467415
## 6: hs133phsenst 5.2105263
## 7: BSgenome.Celegans.UCSC.ce2 4.6543039
## 8: mirna10cdf 4.5150913
## 9: hs133xptenstcdf 4.4647887
## 10: hs133xptense 4.4615385
## ---
## 2511: BSgenome.Ggallus.UCSC.galGal5 0.8140669
## 2512: BSgenome.Mmulatta.UCSC.rheMac8 0.7919463
## 2513: IlluminaHumanMethylation27kanno.ilmn12.hg19 0.7409949
## 2514: mta10probeset.db 0.7306502
## 2515: hgug4845a.db 0.7126742
## 2516: BSgenome.Ptroglodytes.UCSC.panTro5 0.6421801
## 2517: alternativeSplicingEvents.hg19 0.6264259
## 2518: mta10transcriptcluster.db 0.4765625
## 2519: AHEnsDbs 0.4473684
## 2520: hgfocusprobe 0.3349607
We can see that the package with more downloads from the same IP is BSgenome.Hsapiens.NCBI.GRCh38, followed by, hgu95aprobe, org.EcK12.eg.db and the forth one is hs133bptentrezg.
Now we explore if there is some seasons cycles in the downloads, as in figure ?? seems to be some cicles.
First we can explore the number of IPs per month downloading each package:
ggplot(stats, aes(Date, Nb_of_distinct_IPs, col = Package)) +
geom_line() +
theme_bw +
ggtitle("IPs") +
ylab("Distinct IP downloads") +
scal +
theme(axis.text.x=element_text(angle=60, hjust=1)) +
guides(col = FALSE)
Figure 8: Distinct IP per package
As we can see there are two groups of packages at the 2009 years, some with low number of IPs and some with bigger number of IPs. As time progress the number of distinct IPs increases for some packages. But is the spread in IPs associated with an increase in downloads?
ggplot(stats, aes(Date, Nb_of_downloads, col = Package)) +
geom_line() +
theme_bw +
ggtitle("Downloads per IP") +
ylab("Downloads") +
scal +
theme(axis.text.x=element_text(angle=60, hjust=1)) +
guides(col = FALSE)
Figure 9: Downloads per year
Surprisingly some package have a big outburst of downloads to 400k downloads, others to just 100k downloads. But lets focus on the lower end:
ggplot(stats, aes(Date, Nb_of_downloads, col = Package)) +
geom_line() +
theme_bw +
ggtitle("Downloads per package every three months") +
ylab("Downloads") +
scal +
ylim(0, 10000)+
theme(axis.text.x=element_text(angle=60, hjust=1)) +
guides(col = FALSE)
Figure 10: Downloads per year
There are many packages close to 0 downloads each month, but most packages has less than 10000 downloads per month:
ggplot(stats, aes(Date, Nb_of_downloads, col = Package)) +
geom_line() +
theme_bw+
ggtitle("Downloads per package every three months") +
ylab("Downloads") +
scal +
ylim(0, 2500)+
theme(axis.text.x=element_text(angle=60, hjust=1)) +
guides(col = FALSE)
## Warning: Removed 108 rows containing missing values (geom_path).
Figure 11: Downloads per year
As we can see, in general the month of the year also influences the number of downloads. So we have that from 2010 the factors influencing the downloads are the year, and the month.
Maybe there is a relationship between the downloads and the number of IPs per date
ggplot(stats, aes(Date, Nb_of_downloads/Nb_of_distinct_IPs, col = Package)) +
geom_line() +
theme_bw +
ggtitle("IPs") +
ylab("Ratio") +
scal +
theme(axis.text.x=element_text(angle=60, hjust=1)) +
guides(col = FALSE)
Figure 12: Ratio downloads per IP per package
We can see some packages have ocasional raises of downloads per IP. But for small ranges we miss a lot of packages:
ggplot(stats, aes(Date, Nb_of_downloads/Nb_of_distinct_IPs, col = Package)) +
geom_line() +
theme_bw +
ggtitle("IPs") +
ylab("Ratio") +
scal +
theme(axis.text.x=element_text(angle=60, hjust=1)) +
guides(col = FALSE) +
ylim(1, 5)
Figure 13: Ratio downloads per IP per package
But most of the packages seem to be more or less constant and around 2.
One problem to compare the evolution of the packages is that they started at different moments, and as seen with time the number of downloads have been increasing as well as the number of packages. So we need to normalize the starting dates:
norm <- stats[, .(Norm = as.numeric(Date)/as.numeric(max(Date)),
Downloads = Nb_of_downloads/max(Nb_of_downloads)), by = Package]
ggplot(norm, aes(Norm, Downloads, col = Package)) +
geom_line() +
theme_bw() +
ggtitle("Downloads per stage of the package") +
xlab("Date normalized") +
guides(col = FALSE)
Figure 14: Normalization of dates and downloads
We can observe a tendency to have a decrease of the number of downloads after being includedd in Bioconductor and later it raises again.
sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] data.table_1.10.4 ggplot2_2.2.1 BiocStyle_2.4.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.10 knitr_1.15.1 magrittr_1.5 munsell_0.4.3
## [5] colorspace_1.3-2 stringr_1.2.0 highr_0.6 plyr_1.8.4
## [9] tools_3.4.0 grid_3.4.0 gtable_0.2.0 htmltools_0.3.6
## [13] yaml_2.1.14 lazyeval_0.2.0 rprojroot_1.2 digest_0.6.12
## [17] tibble_1.3.0 bookdown_0.3 evaluate_0.10 rmarkdown_1.5
## [21] labeling_0.3 stringi_1.1.5 compiler_3.4.0 scales_0.4.1
## [25] backports_1.0.5